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ABSTRACT 

The monarch butterfly {Danaus plexippus) is 
emerging as a model organism to study the mech- 
anisms of circadian clocks and animal navigation, 
and the genetic underpinnings of long-distance mi- 
gration. The initial assembly of the monarch genome 
was released in 2011, and the biological interpret- 
ation of the genome focused on the butterfly's mi- 
gration biology. To make the extensive data 
associated with the genome accessible to the 
general biological and lepidopteran communities, 
we established MonarchBase (available at http:// 
monarchbase.umassmed.edu). The database is an 
open-access, web-available portal that integrates 
all available data associated with the monarch 
butterfly genome. Moreover, MonarchBase 
provides access to an updated version of genome 
assembly (v3) upon which all data integration is 
based. These include genes with systematic anno- 
tation, as well as other molecular resources, such 
as brain expressed sequence tags, migration ex- 
pression profiles and microRNAs. MonarchBase 
utilizes a variety of retrieving methods to access 
data conveniently and for integrating biological 
interpretations. 

INTRODUCTION 

The eastern North American monarch butterfly (Danaus 
plexippus) undergoes a spectacular long-distance migra- 
tion in the fall. The monarch has emerged as an excellent 
model for investigating the general molecular and neural 
basis of long-distance migration (1,2). The remarkable 
navigational capabilities of monarchs are part of a 
genetic program that is initiated in migrants; the 
butterflies that travel south to Mexico are at least two 
generations away from the previous generation of fall 



migrants (3). Fundamental to decoding the genetic basis 
of the long-distance migration has been the construction 
of the draft sequence of the monarch genome (4). 

The monarch genome and its transcriptome were 
sequenced de novo using next-generation sequencing 
technologies (4). The difficulty of assembling the genome 
from wild-caught butterflies with potentially high hetero- 
zygosity was overcome, thus allowing the construction of 
the initial version of the monarch genome assembly 
(vl) which consisted of 273 Mb with 16 866 protein-coding 
genes (4). 

Although the original assembly was quite complete for 
gene coverage, its quality was hindered because of small 
scaffold size (N50 of 53 kb) and high redundancy (~10%). 
By implementing new assembling strategies and new 
libraries, these difficulties have been largely overcome, re- 
sulting in a substantial improvement of the monarch 
butterfly assembly (named v3): 90% of the 249 Mb 
assembled sequence is now represented by 366 major scaf- 
folds whose minimum length is 160 kb. The improved or- 
ganization of the monarch genome should allow more 
precise annotation work. Furthermore, it provides a 
high quality reference that will facilitate future population 
genetic studies. For example, researchers now can 
re-sequence other monarch populations or non-migratory 
Danaus species to help identify migratory genes. 

MonarchBase was developed as a public database for 
readily accessing the monarch genome, its proteome and 
related biological processes. The growing amount of 
genomic data and its continuous qualitative improvement 
necessitated a centralized database to coordinate the 
inflow of monarch genomic resources. Compared with 
public data repository, organism-specific databases 
provide the community with specialized data sets, 
powerful retrieving interfaces, a platform for extensive 
biological interpretations and a site for the integration 
of a variety of previously dispersed data types. 
MonarchBase serves not only researchers interested in 
monarch butterfly biology and the biology of the migra- 
tion but also the wider lepidopteran community. 
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We report here the development of MonarchBase, its com- 
ponents and the latest version of monarch genome 
assembly and its corresponding geneset. 



RESULTS AND DISCUSSION 

Data content 

The current data content in MonarchBase is summarized 
in Table 1 . 

Genome assembly 

Assembling genomes with potential high levels of poly- 
morphism has remained a challenge, as haplotypes are 
assigned to allelic variants, which results in residual redun- 
dancy. The occurrence of residual redundancy in the 
initial assembly has been reported in several studies (8, 
12). To remove redundancy from the initial monarch vl 
assembly (4), we used both automated and manual 
methods. In brief, the shorter one of a duplicated pair of 
sequences was discarded; this was done by considering 
sequence identity and sequencing depth. Suspicious 
sequences that were only detected in one sequencing 
library were also excluded. Paired-end sequencing 
libraries, from 200 bp to 20 kb (4), were aligned to the 
non-redundant sequences, step by step, using BOWTIE2 
(13). Local alignment mode of BOWTIE2 helped us ef- 
fectively map Roche 454 libraries (8 and 20 kb), which 
were not as rigorously analyzed previously (4). Scaffolds 
were subsequently constructed based on mapped linkages 



Table 1. Data content in current version of MonarchBase 



Genome reference 
Assembly (v3) 
Repeat 

Gene repertoire 

Official geneset (OGS2.0) 
GLEAN consensus set 
Maker consensus set 
AUGUSTUS ab initio set 
GeneMark ab initio set 
Genscan ab initio set 
Glimmer ab initio set 
SNAP ab initio set 
RNAseq assembly 

Annotation for OGS2.0 
Public databases 3 
Lepidoptera genesets b 
GO term 
InterPro domain 
KO 

Ortholog group 
Non-coding RNAs 

MicroRNA 

Transfer RNA C 

Ribosome RNA d 
Other resources 



using SSPACE v2.0 (14). The resulting assembly (v3) 
consists of 5397 scaffolds spanning ~249Mb (Table 1). 
The monarch genome was previously estimated to be 
0.29 pg by Feulgen image analysis (15). However, the 
actual assembled genome size for many species is smaller 
than their early estimated size (7,16,17), partly because of 
the presence of heterochromatin, which is near impossible 
to sequence and assemble (12). Compared with the 
previous version, the latest monarch assembly has a sub- 
stantial improvement in connectedness (Table 2). Gene 
coverage in the new geneset (OGS2.0) is also increased, 
although our previous, initial version showed good quality 
of gene coverage (Table 2). The monarch whole genome 
shotgun project has been deposited at DDBJ/EMBL/ 
GenBank under the accession AGBW00000000. The 
version described in this paper (v3) is the second 
version, AGBW02000000. 

Genome annotation 

We identified 25 Mb of sequence as repetitive sequences 
and transposable elements for the v3 assembly, as 
described for the vl assembly (4). We applied a variety 
of prediction methods to annotate repeat-masked scaf- 
folds and provide accurate gene models (Table 1). Five 
ab initio prediction sets, including AUGUSTUS (23), 
GeneMark (24), Genscan (25), GlimmerHMM (26) and 
SNAP (27), were independently generated as described 
earlier (4). Importantly, we added data from the recently 
released geneset of the passion-vine butterfly Heliconius 
melpomene (8) to help identify butterfly specific genes. 



5397 scaffolds spanning 248.6 Mb genome with 6.7 Mb as gaps 
121269 repetitive elements spanning 25.3 Mb genome 

15 130 
16216 

13 744 

14 550 
27 256 
12921 
23 898 
25 758 

18 563 genes with 23 543 alternative transcripts 

12 943 

13 572 

8120 genes assigned with 1539 GO terms 
10 034 genes assigned with 5069 domains 
8157 genes assigned with 3856 KO terms 
4708 genes assigned into 264 pathways 

198 021 proteins from 15 species assigned into 34 392 ortholog groups 

116 

379 
127 



Brain ESTs 9484 
ESTs with microarray data 9417 



"Public databases used for annotating monarch genes include RefSeq (5), UniRef50 (6) and non-redundant database of NCBI. 
b Lepidopteran genesets include Bombyx geneset (7) and Heliconius geneset (8). 
c tRNAs were predicted by tRNAscan-SE (9). 

d rRNAs were predicted by RNAmmer (10) or Rfam scan pipeline (11) following the default settings. 
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Table 2. Quality control of the latest monarch assembly v3 compared with vl and the other lepidopterans 



Danaus v3 



Danaus vl 



Heliconius vl.l a 



Bombvx h 



Assembly statistics 0 
L50 (bp) 
N50 

L90 (bp) 
N90 



715 606 
101 

160499 
366 



CEGMA analysis for 248 ultra-conserved CEGs present in genome 

# Complete 230 

# Partial 243 
Homologs in Drosophila geneset 6 

# Recovered 9655 
Average coverage 55.2% 

Homologs in Tribolium geneset 1 

# Recovered 11015 
Average coverage 63.8% 

Homologs in Bombyx geneset g 

# Recovered 13 010 
Average coverage 84.3% 

Homologs in Heliconius geneset 11 

# Recovered 12 860 
Average coverage 86.5% 



53 032 
1138 
6262 
7140 

229 
241 

9653 
54.5% 

11017 
63.0% 

12 996 
83.1% 

12 840 
84.9% 



194 302 
345 
38051 
1634 

214 

237 

9539 
53.0% 

10915 
61.9% 

12 820 
82.4% 



3 998 728 
38 
60 675 
260 

195 
241 

9524 
52.8% 

10983 
61.3% 



"The Heliconius assembly used here is the latest version available for downloading from http://butterflygenome.org/, date to June 1, 2012, though 
a better N50 value (277 kb) was reported on a linkage-based improved version (8), which was not available to us. 
b The Bombyx assembly (7) was downloaded from SilkDB 2.0 (18). 

Tor quantitative statistics of assembly, N50 indicates that half of the total sequence in the assembly is presented by a total of N50 scaffolds of length 
more than or equal to the L50 size; in a similar way, N90 and L90 indicates how 90% of sequence is presented in the assembly. 
Statistics of the complete and partial presence of 248 ultra-conserved CEGs were calculated by CEGMA pipeline v2.4 following the default 
settings (19). 

"Drosophila geneset r5.36 is from FlyBase (20) and only the longest protein per gene was used for analysis, 
calculated by GenBlastA (21) as follows: genblasta_vl.0.4_linux_x86_64 -P blast -pg tblastn -p T -e le-5 -g T 



processed by a custom Perl script to sort out coverage on a single scaffold. 
Tribolium geneset 3.0 is from BeetleBase (22) and analyzed as described earlier. 
s Bombyx geneset is from SilkDB 2.0 (18) and analyzed as described earlier. 
b Heliconius geneset 1.1 (8) is from http://butterflygenome.org/ and analyzed as described earlier. 



Recovered queries were automatically 
-f F -a 0.5 -r 1 -c 0.5, output then was 



All these predicted genesets and the evidence of monarch 
cDNAs and insect homology were selected by GLEAN 
(28) to generate a consensus geneset. In addition, we 
used the MAKER annotation pipeline (29) to build 
another consensus geneset using the same inputs as used 
for GLEAN. As a result, GLEAN and MAKER identified 
16216 and 13 969 genes, respectively. According to the 
evaluation of 389 manually curated gene models and 20 
cloned monarch genes, we chose the non-redundant 
GLEAN set as our new reference geneset, though we 
kept both GLEAN and MAKER, as well as all other in- 
dependent prediction genesets, that are available in 
MonarchBase for browsing (Table 1). 

A total of 15 130 of 16216 GLEAN genes whose exist- 
ence was supported from either monarch cDNAs or insect 
homologs were selected as the new official geneset 
(OGS2.0) for comprehensive annotation (Table 1). We 
performed BLASTP against both RefSeq (5) and 
UniRef50 (6) databases to report annotation information. 
We also performed both BLASTP and BLASTX against 
the non-redundant NCBI database to help annotate those 
uncommon genes and pseudogenes. 

We used several methods to annotate genes into families 
and pathways. A local InterProScan (30) was run against 
the InterPro domain database (31) to map domains and 
GeneOntology (GO) terms (32) to monarch genes. KEGG 
is well-known for their collection of manually delineated 
pathway maps representing the current state of knowledge 



on the molecular interactions and reactions (33). We 
queried monarch proteins against KEGG orthology 
(KO) using BLASTP (le-5) and assigned them to biolo- 
gical pathways. In addition, we used an OrthoMCL 
algorithm (34) to analyze gene orthology among 15 
species, as described (4), and clustered genes into 
ortholog groups representing monarch-specific genes, 
butterfly specific genes (monarch and Heliconius) and 
lepidopteran-specific genes (monarch, Heliconius and 
Bombyx), as well as universal genes. For comparative 
analysis, we performed multiple alignment for each 
ortholog group using MUSCLE (35) and selected 
well-aligned blocks using Gblocks (36). 

Functional resources 

By mapping monarch brain-derived expressed sequence 
tags (ESTs) (37) to the geneset, previously identified tran- 
scripts associated with the oriented flight behavior of mi- 
gratory butterflies (38) have all been annotated (4). In 
addition, more than 7000 monarch genes have expression 
data for comparison between summer and migratory 
monarchs (38). Using an integration approach, we also 
found an unexpected sexually dimorphic pattern within 
the monarch juvenile hormone biosynthesis regulatory 
pathway (4). RNAseq reads, representing multiple 
monarch tissues and developmental stages (4), were 
aligned back to the new assembly using Cufflinks (39) to 
present alternative splicing patterns. Universal expression 
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value for each gene was calculated based on the 
normalized transcriptome coverage, as described (4). 
Small non-coding RNA sequencing data for both 
summer and migratory butterflies (4) were also integrated 
with the new assembly. 

Database organization 

We store and manage data for MonarchBase using 
MySQL (http://www.mysql.com). Several Common 
Gateway Interface scripts were developed to process 
users' input to search the database, connect to third-party 
application, parse the result and generate pages for 
retrieved data. A schematic diagram of database organ- 
ization is shown in Figure 1. 

Genome browser 

MonarchBase utilizes a genome browser, implemented 
with GBrowse 2.0 (40), to navigate annotation along 
with the genome assembly. GBrowse is a well-known 
browser that integrates database and interactive web 
pages for displaying annotations of genomes, and has 
been applied to a variety of databases (18,22,41). 
Through GBrowse of MonarchBase, researchers can 
access data representing consensus genesets, independent 
genesets, alternative splicing patterns, homolog and 
cDNA alignments, repeat content, non-coding RNAs 
and other genomic features. 

Accurate prediction of gene models is the most import- 
ant task of genome annotation work. For consistency 
among users, we provide, as already indicated, an 
official reference geneset, OGS2.0, which is superior in 
overall quality to each of the independent genesets. 
Because each gene prediction program currently in use 
has both strengths and weaknesses, displaying all predic- 
tion sets is useful to optimize gene models when there are 
conflicting overlaps between sets. 

Retrieved data 

MonarchBase has been designed with several entry sites 
and accepts entry ID, key words or sequence as input to 
retrieve data for either a single gene or a group of genes 
(Figure 1). Gene page is the core of MonarchBase, at 
which researchers can access all related information for 



each OGS2.0 gene, including gene symbol, genomic 
position, evidence of monarch cDNA or insect 
homology, gene family, biological pathway, ortholog 
group and nucleotide and deduced protein sequence 
(Figure 1). Each entry in the gene page links to inform- 
ative web page. MonarchBase can also return a list 
of monarch genes, coupled with biological interpretation, 
for retrieving entries of GO, InterPro, KO, ortholog 
groups or pathways. In addition, users can browse a list 
of differentially expressed ESTs and expanded/contracted 
gene families. 

BLAST server 

Local Basic Local Alignment Search Tool (BLAST) is one 
of the most useful entrance sites for a genomic database. 
At MonarchBase, users can search against a variety of 
monarch genome-wide data, including scaffolds, contigs, 
genes and ESTs. We also packed 332 930 proteins from 
genesets of 20 insect species as a single database, which 
facilitates search for homologs of most insect orders. 
We used html4blast, a Bioperl module (42), to customize 
BLAST output. Through extended links, users can click 
on identifiers to retrieve relevant information 
conveniently. 

Broad application 

As monarchs are famous for their long-distance migra- 
tion, the biological interpretation of the genome has 
focused on genes potentially involved in the migration. 
We have manually annotated more than 1000 genes of 
biological interest for monarch migration biology and 
curated more than 100 chemoreception genes (4). With 
the new assembly, we have updated these gene inventories 
with OGS2.0 gene models; these are available for 
browsing in MonarchBase. MonarchBase also includes 
data from other insect species, which are integrated with 
appropriate links to other databases. We also provided 
lepidopteran-specific genes, microRNAs and contracted 
or expanded gene families based on our analysis. Users 
from other fields can also download multiple datasets 
for use in their local comparative analyses. Detailed in- 
structions about how to use each component can be 
checked in the help file of MonarchBase. 



Local BLAST 



Genome browser 


4+ | Gene page 




+ * 


| ESTs 


| Monarch biology | 


Expression profile 


^ microRNAs 



^ Resources of other species 



^j) rtholog group | ^ | Multiple alignment""! I 

** , 4 ' 



InterPro domain 

** . 

Gene ontology 



Expanded and contracted 
gene family 



EGGortholo gy | 4+ [ Pathway 



Figure 1. Schematic view of the components of MonarchBase and their connections. The green arrows represent the clickable connections between 
the components. Thin arrows represent the major entrances of MonarchBase accepting users' input to retrieve data: black arrows indicate the 
sequence inputs; blue arrows indicate ID inputs; red arrows indicate keyword inputs; and purple arrows indicate browsing menus. 
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FUTURE DIRECTIONS 

Population genomic studies for monarchs and other 
Danaus species should be forthcoming. Identifying vari- 
ations will be useful for analyzing population substructure 
and distribution rates, dating the migration of the eastern 
North American population and eventually uncover can- 
didate migratory genes. 

The completeness and contiguity of the monarch 
genome assembly will be continuously improved as more 
genomic sequences become available. In addition, the 
manual curation of additional genes is ongoing and will 
be updated in MonarchBase. We encourage other research 
groups to contribute annotations, curations and related 
datasets via Email (steven.reppert@umassmed.edu). 
Suggestions and requests for additional functions are 
also welcome. 
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